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ABSTRACT 

Code Excited Linear Prediction (CELP) 
speech coders exhibit good performance at data 
rates as low as 4800 bps. The major drawback to 
CELP type coders is their large computational 
requirements. The Vector Sum Excited Linear 
Prediction (VSELP) speech coder utilizes a 
codebook with a structure which allows for a 
very efficient search procedure. Other 
advantages of the VSELP codebook structure 
will be discussed and a detailed description of a 
4.8 kbps VSELP coder will be given. This coder 
is an improved version of the VSELP algorithm, 
which finished first in the NSA's evaluation of 
the 4.8 kbps speech coders [1]. The coder 
employs a subsample resolution single tap long 
term predictor, a single VSELP excitation 
codebook, a novel gain quantizer which is robust 
to channel errors, and a new adaptive 
pre/postfilter arrangement. 

INTRODUCTION 

Vector Sum Excited Linear Prediction falls 
into the class of speech coders known as Code 
Excited Linear Prediction (CELP) (also called 
Vector Excited or Stochastically Excited) 
[2,4,6]. The VSELP speech coder was designed 
to accomplish three goals: 

1 . Highest possible speech quality 

2. Reasonable computational complexity 

3 . Robustness to channel errors 

These three goals are essential for wide 
acceptance of low data rate (4.8-8 kbps) speech 
coding for telecommunications applications. 

The VSELP speech coder achieves these goals 
through efficient utilization of a structured 
excitation codebook. The structured codebook 
reduces computational complexity and increases 
robustness to channel errors [1,3]. A single 
optimized VSELP excitation codebook is used to 
achieve high speech quality while maintaining 


reasonable complexity. A subsample resolution 
single tap long term predictor noticeably 
improves performance for high pitched speakers. 
A novel gain quantizer is employed which 
achieves high coding efficiency and robustness 
to channel errors. Finally, a new adaptive 
pre/post filter arrangement is used to enhance the 
quality of the reconstructed speech . 

BASIC CODER STRUCTURE 

Figure 1 is a block diagram of the VSELP 
speech decoder. The 4.8 kbps VSELP 
coder/decoder utilizes two excitation sources. 
The first source is the long term ("pitch") 
predictor state, or adaptive code book [4]. The 
second is the VSELP excitation codebook. For 
the 4.8 kbps coder, the VSELP codebook 
contains the equivalent of 1024 codevectors. The 
excitation vectors, selected from the two 
excitation sources, are multiplied by their 
corresponding gain terms and summed, to 
become the combined excitation sequence ex(n). 
After each subframe, ex(n) is used to update the 
long term filter state (adaptive codebook). The 
synthesis filter is a direct form 10th order LPC 
all-pole filter. The LPC coefficients are coded 
once per 30 msec frame and updated in each 7.5 
msec subframe through interpolation. The 
excitation parameters are also updated in each 
7.5 msec subframe. The number of samples in a 
subframe, N, is 60 at an 8 kHz sampling rate. 
The "pitch" prefilter and spectral postfilter will 
be discussed below. 

Table 1 shows the bit allocations for the 4.8 
kbps VSELP coder. The 10 LPC coefficients are 
coded using scalar quantization of the reflection 
coefficients. An energy term, Rq(0), which 
represents the average speech energy per frame 
is also coded once per frame. To accommodate 
noninteger values of the long term predictor 
delay, eight bits are used to code L. A polyphase 
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FIR interpolating filter generates the excitation 
vectors for noninteger delays [5]. The two 
excitation gains are vector quantized to 7 bits 
(GS-PO code) per subframe. 



0i m = +1 if bit m of codeword i = 1 

Gim = -1 if bit m of codeword i = 0 
Note that if we complement all the bits in 
codeword i, the corresponding codevector is the 
negative of codevector i. Therefore, for every 
codevector, its negative is also a codevector in 
the codebook. These pairs are called 
complementary codevectors since the 
corresponding codewords are complements of 
each other. 

The excitation codewords for the VSELP 
coder are more robust to bit errors than the 
excitation codewords for random codebooks. A 
single bit error in a VSELP codeword changes 
the sign of only one of the basis vectors. The 
resulting codevector is still similar to the desired 
codevector. 


Figure I - VSELP Speech Decoder 


PARAMETER 

BITS/SUBFRAME 

BITS/FRAME 

LPC coefficients 


37 

energy - R q (0) 


5 

excitation code - 1 

10 

40 

lag - L 

8 

32 

GS-PO code 

7 

28 

<synch, parity> 


2 

TOTAL 

25 

144 


Table 1 - Bit Allocations for 4.8 kbps 


VSELP CODEBOOK STRUCTURE 

The coder uses a single VSELP excitation 
codebook, which contains 2 M codevectors. 
These are constructed from a set of M basis 
vectors, where M = 10 for the 4.8 kbps coder. 
Defining v m (n) as the m lh basis vector and uj(n) 
as the i th codevector, each from the VSELP 
codebook, then: 

Ui (n) = ^ 0j m v m (n) (1) 

m=l 

where 0 < i < 2 M -1 and 0£n<N-l. 

In other words, each codevector in the 
codebook is constructed as a linear combination 
of the M basis vectors. The linear combinations 

are defined by the 0 parameters. 0i m is defined 
as: 


SELECTION OF EXCITATION 
VECTORS 

Figure 2 is a block diagram which shows the 
process used to select the two codebook indices 
L and I. These excitation parameters are 
computed every subframe. 

Weighted Input Minus Zero Input Response of H(Z) 



Figure 2 - Excitation Codeword 
Selection 


H(z) is the bandwidth expanded synthesis filter, 
H(z) = l/A(zA), where X is the noise weighting 
factor. Signal p(n) is the perceptually weighted 
(with noise weighting factor X) input speech for 
the subframe with the zero input response of 
bandwidth expanded synthesis filter (H(z)) 
subtracted out [6]. 

The two excitation vectors are selected 
sequentially, one from each of the two excitation 
codebooks (adaptive codebook and VSELP 
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codebook). Each codebook search attempts to 
find the codevector which minimizes the total 
weighted error. 

Although the codevectors are chosen 
sequentially, the gain of excitation vector chosen 
from the adaptive codebook is left "floating". 
The adaptive codebook is searched first, 
assuming gain y is zero. The VSELP codebook 
is searched next, with optimal y and p used for 
each codevector being evaluated. This joint 
optimization can be achieved by orthogonalizing 
each weighted (filtered) codevector to the 
weighted excitation vector selected from the 
adaptive codebook, prior to the VSELP 
codebook search. While this task seems 
impractical in general, for VSELP codebook it 
reduces to orthogonalizing only the weighted 
basis vectors. 

The adaptive codebook is searched first for an 
index L which minimizes: 

N-l 

E'L = X(p(n)- P’bL(n)) 2 (2) 

n=0 

where b'iXn) is the zero state response of H(z) to 
bL(n) and where p' is optimal for each codebook 
index L. 

To search the VSELP codebook, the zero state 
response of each codevector to H(z) must be 
computed. From the definition of the VSELP 
codebook ( 1 ), filtered codevector fi(n) can be 
expressed as: 

fi (n) = 9im qm ( n ) (3) 

m=l 

where q m (n) is the zero state response of H(z) to 
basis vector v m (n), for 0 < n < N-l. 

The orthogonalized filtered codevectors can 
now be expressed as: 

f i ( n ) = ^ 9im q m ( n ) (4) 

m=l 

for 0 < i < 2 M -1 and 0 < n £ N-l. Thus q’ m (n) 
is q m (n) with the component correlated to b'iXn) 
removed. The codebook search procedure now 
finds the codeword i which minimizes: 

E'i=^(p(n)-Y’f’i(n )) 2 ( 5 ) 

n=0 

where y ' is optimal for each codevector i. Once 
we have filtered and orthogonalized the basis 


vectors, the VSELP codebook search is initiated. 
Defining: 


N-l 

C i= Z f ’i( n )P( n ) 

n=0 

(6) 

and 

N-i 

Gi= I(f ») 2 

n4) 

(7) 

then the codevector which maximizes: 



(<T 

G; 

(8) 


is chosen. The search procedure evaluates ( 8 ) for 
each codevector. Using properties of the VSELP 
codebook structure, the computations required 
for computing Ci and Gi can be greatly 
simplified. Defining: 

N-l 

R m = 2 X c l'm( n )P( n ) 1 < m < M (9) 

n=0 

and 

N-l 

Dmj = 4Xq , m(n)q' j (n) l<m<j<M (10) 

n=0 

Q can be expressed as: 



and Gi is given by: 

MM M 

Gi = r- ^ 0im ^ij D m j + ^ ^Djj (12) 

1 j=2 m=l j=l 

Assuming that codeword u differs from 
codeword i in only one bit position, say position 
v such that 0 UV = -0i v and 0 um = 0im for m * v 
then: 

C u = Q + 0 UV R v (13) 

and 

v-l M 

Gu = Gi + 0 u j 0uv Djv + 0uj 0uv D v j (14) 

j=l j=v+l 

If the codebook search is structured such that 
each successive codeword evaluated differs from 
the previous codeword in only one bit position, 
then (13) and (14) can be used to update Q and 
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Gi in a very efficient manner. Sequencing of the 
codewords in this manner is accomplished using 
a binary Gray code. 

Note that complementary codewords will have 
equivalent values for (8). Therefore only half of 
the codevectors need to be evaluated. Once the 
codevector which maximizes (8) is found, the 
sign of the corresponding Q will determine 
whether the selected codevector or its negative 
will yield a positive gain. If Q is positive then i 
is the selected codeword; if Q is negative then 
the one's complement of i is selected as the 
codeword. 

QUANTIZATION OF EXCITATION 
GAINS 

The quantization of the excitation gains 
consists of two stages. The first stage codes the 
average speech energy once per frame. The 
quantized value of this energy, Rq(0), is 
specified with five bits, using 2 dB quantization 
steps for 64 dB of dynamic range. In the second 
stage, a GS-PO code is selected every subframe. 
This code, when taken in conjunction with Rq(0) 
and the state of the speech decoder, determines 
the excitation gains for the subframe. The 
selection of the GS-PO code takes place after the 
two excitation vectors, L and I, have been 
chosen. 

The following definitions are used to 
determine the GS-PO code. The combined 
excitation function, ex(n), is given by: 

ex(n) = P co(n) + yci(n) 0<n^N-l (15) 
where: 

co(n) is the long term prediction vector, bi/n) 

ci(n) is the codevector selected from the 
VSELP codebook, ui(n) 

The energy in each excitation vector is given by: 

R x (k) = Saw k - o,i (16) 

n=0 

Let RS be the approximate residual energy at a 
given subframe. RS is a function of N, Rq(0), 
and the normalized prediction gain of the LPC 
filter. It is defined by: 

N P 

RS = N Rq(0) ri(l-rf) (17) 

i=l 

where r; is the i th reflection coefficient 
corresponding to the set of direct form filter 


coefficients (ai's) for the subframe. 
GS, the energy offset , is a coded parameter 
which refines the estimated value of RS. R, the 
approximate total subframe excitation energy, is 
defined as: 

R = GS RS (18) 

PO, the approximate energy contribution of the 
long term prediction vector as a fraction of the 
total excitation energy at a subframe, is defined 
to be: 

2 

B R.(0) 

PO = - r x - where 0 < PO < 1 (19) 

Thus p and y are replaced by two new 
parameters: GS and PO. The transformations 
relating p and y to GS and PO are given by: 



The GS-PO pair is vector quantized using a 
codebook of 128 vectors. The codebook was 
designed using the LBG algorithm [7], using the 
normalized weighted error as the distortion 
criterion. Figure 3 shows the distribution of the 
GS-PO codebook vectors. 



0.0 0.2 0.4 0.6 0.8 1.0 


PO 

Figure 3 - PO vs GS in dB 
for gain codebook 

The vector, which minimizes the total weighted 
error energy for the subframe, is chosen from 
the GS-PO codebook. The codebook search 
procedure requires only five multiply- 
accumulates per vector evaluation. 

This technique of quantizing the gains has a 
number of advantages. First, the coding is 
efficient. The coding of the energy once per 


Internationa! Mobile Satellite Conference, Ottawa, 1990 


681 





11.4 


frame solves the dynamic range issue. The gain 
quantization performs equally well at all signal 
levels within the range of the Rq(0) quantizer. 
With the average energy factored out, the two 
gains can be vector quantized efficiently. In 
minimizing the weighted error, the vector 
quantizer takes into account the correlation 
between the two weighted excitation vectors. 
Second, the values of GS and PO are well 
behaved as can be seen in Figure 3. Whereas the 
optimal value for (J, the adaptive codebook gain, 
can occasionally get very large, PO is bounded 
by 0 and 1. Error propagation effects are also 
greatly reduced by this quantization scheme. 
Since the energies in the excitation vectors are 
used to normalize the excitation gains, previous 
channel errors affecting the energy in the 
adaptive codebook vector have very little effect 
on the decoded speech energy. Channel errors in 
the LPC coefficients are also automatically 
compensated for at the decoder in calculating the 
excitation gains. In fact as long as the code for 
the average frame energy, R q (0), is received 
correctly, the speech energy at the decoder will 
not be much greater than the desired energy (see 
Figure 3 for range of GS) and no "blasting" will 
occur. 

OPTIMIZATION OF BASIS VECTORS 

The basis vectors defining the VSELP 
codebook are optimized over a training database. 
The optimization criterion is the minimization of 
the total normalized weighted error. The 
normalized weighted error for each subframe can 
be expressed as a function of individual samples 
of each of the 10 basis vectors from the VSELP 
excitation codebook, given I, bL(n), p(n), the 
excitation gains, and the impulse response of 
H(z) for each subframe of the training data. The 
optimal basis vectors are computed by solving 
the 600 (10 basis vectors, 60 samples per vector) 
simultaneous equations which result from taking 
the partial derivatives of the total normalized 
weighted error function with respect to each 
sample of each basis vector and setting them 
equal to zero. Since the coder subframes are not 
independent, this procedure is iterated in a closed 
loop fashion. Figure 4 shows the improvement 
in weighted segmental SNR for each iteration. 



Initially the basis vectors are populated with 
random Gaussian sequences (iteration 0) which 
yields a weighted segmental SNR of 10.61 dB. 
The weighted segmental SNR increases to 1 1.28 
dB after nine iterations. The subjective quality 
improvement due to the optimization of the basis 
vectors is significant. The objective as well as 
subjective improvements are retained for speech 
data outside the training data base. 


ADAPTIVE PRE AND 
POSTFILTERING 

The speech decoder creates the combined 
excitation signal, ex(n), from the long term filter 
state and the VSELP excitation codebook. The 
combined excitation is then processed by an 
adaptive "pitch" prefilter to enhance the 
periodicity of the excitation signal (see Figure 1). 
Following the adaptive pitch prefilter, the 
prefiltered excitation is applied to the LPC 
synthesis filter. After reconstructing the speech 
signal with the synthesis filter, an adaptive 
spectral postfilter is applied to further enhance 
the quality of the reconstructed speech. The pitch 
prefdter transfer function used is given by: 


H p (z) = 

1-^Z'L 

where % = z Min[ (3, VFfT ] and e = 0.4 


( 22 ) 

(23) 


Note that the periodicity enhancement is 
performed on the synthetic residual in contrast to 
pitch postfiltering which performs the 
enhancement on the synthesized speech 
waveform [8]. This significantly reduces 
artifacts in the reconstructed speech due to 
waveform discontinuities which pitch 
postfiltering sometimes introduces. For 
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noninteger values of L, a polyphase FIR filter is 
used to compute the fractionally delayed 
excitation samples. Finally to ensure unity power 
gain between the input and the output of die pitch 
prefilter, a gain scale factor is calculated to scale 
the pitch prefiltered excitation prior to applying it 
the LPC synthesis filter. 

The form of the adaptive spectral postfilter 
used is: 

10 

1 -X “Hi z_i 

H s (z)= 0 £ v £ 1 (24) 

1 ^ “i z ‘ i 

i=l 

where the oq's are the coefficients of the 
synthesis filter. To derive the numerator, the 
v* ai coefficients are converted to the 
autocorrelation domain (the autocorrelation of the 
impulse response of the all pole filter 
corresponding to the denominator of (24) is 
calculated for lags 0 through 10). A binomial 
window is then applied to the autocorrelation 
sequence [9] and the numerator polynomial 
coefficients are calculated from the modified 
autocorrelation sequence via the Levinson 
recursion. This postfilter is similar to that 
proposed by Gersho and Chen [10]. However, 
the use of the autocorrelation domain windowing 
results in a frequency response for the numerator 
that tracks the general shape and slope of the 
denominator's frequency response more closely. 
To increase postfiltered speech "brightness", an 
additional first order filter is used of the form: 

H B (z) = l-uz- 1 (25) 

The following postfilter parameter values are 
used: v=.8, B eq =1200 Hz, u=0.4. Note that B eq 
is the bandwidth expansion factor which 
specifies the degree of smoothing which is 
performed on the denominator to generate the 
numerator. 

As in the case of the pitch prefilter, a method 
of automatic gain control is needed to ensure 
unity gain through the spectral postfilter. A scale 
factor is computed for the subframe in the same 
manner as was done for the pitch prefilter. In the 
case of the spectral postfilter, this scale factor is 
not used directly. To avoid discontinuities in the 
output waveform, the scale factor is passed 
through a first order low pass filter before being 
applied to the postfilter output. 


CONCLUSIONS 

A high quality 4.8 kbps speech coder has 
been described. The complexity of the coder is 
low enough so that it can be implemented on a 
single DSP device such as the Motorola 
DSP56000. In addition to its very high speech 
quality, the parameters are inherently robust to 
channel errors and can be error protected very 
efficiently. 
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